{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluating a classification model ([video #9](https://www.youtube.com/watch?v=85dtiMz9tSo&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=9))\n",
    "\n",
    "Created by [Data School](http://www.dataschool.io/). Watch all 9 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).\n",
    "\n",
    "**Note:** This notebook uses Python 3.6 and scikit-learn 0.19.1. The original notebook (shown in the video) used Python 2.7 and scikit-learn 0.16, and can be downloaded from the [archive branch](https://github.com/justmarkham/scikit-learn-videos/tree/archive)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Agenda\n",
    "\n",
    "- What is the purpose of **model evaluation**, and what are some common evaluation procedures?\n",
    "- What is the usage of **classification accuracy**, and what are its limitations?\n",
    "- How does a **confusion matrix** describe the performance of a classifier?\n",
    "- What **metrics** can be computed from a confusion matrix?\n",
    "- How can you adjust classifier performance by **changing the classification threshold**?\n",
    "- What is the purpose of an **ROC curve**?\n",
    "- How does **Area Under the Curve (AUC)** differ from classification accuracy?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review of model evaluation\n",
    "\n",
    "- Need a way to choose between models: different model types, tuning parameters, and features\n",
    "- Use a **model evaluation procedure** to estimate how well a model will generalize to out-of-sample data\n",
    "- Requires a **model evaluation metric** to quantify the model performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model evaluation procedures\n",
    "\n",
    "1. **Training and testing on the same data**\n",
    "    - Rewards overly complex models that \"overfit\" the training data and won't necessarily generalize\n",
    "2. **Train/test split**\n",
    "    - Split the dataset into two pieces, so that the model can be trained and tested on different data\n",
    "    - Better estimate of out-of-sample performance, but still a \"high variance\" estimate\n",
    "    - Useful due to its speed, simplicity, and flexibility\n",
    "3. **K-fold cross-validation**\n",
    "    - Systematically create \"K\" train/test splits and average the results together\n",
    "    - Even better estimate of out-of-sample performance\n",
    "    - Runs \"K\" times slower than train/test split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model evaluation metrics\n",
    "\n",
    "- **Regression problems:** Mean Absolute Error, Mean Squared Error, Root Mean Squared Error\n",
    "- **Classification problems:** Classification accuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classification accuracy\n",
    "\n",
    "[Pima Indians Diabetes dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) originally from the UCI Machine Learning Repository"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read the data into a pandas DataFrame\n",
    "import pandas as pd\n",
    "path = 'data/pima-indians-diabetes.data'\n",
    "col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']\n",
    "pima = pd.read_csv(path, header=None, names=col_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnant</th>\n",
       "      <th>glucose</th>\n",
       "      <th>bp</th>\n",
       "      <th>skin</th>\n",
       "      <th>insulin</th>\n",
       "      <th>bmi</th>\n",
       "      <th>pedigree</th>\n",
       "      <th>age</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>148</td>\n",
       "      <td>72</td>\n",
       "      <td>35</td>\n",
       "      <td>0</td>\n",
       "      <td>33.6</td>\n",
       "      <td>0.627</td>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>85</td>\n",
       "      <td>66</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>26.6</td>\n",
       "      <td>0.351</td>\n",
       "      <td>31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>183</td>\n",
       "      <td>64</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>23.3</td>\n",
       "      <td>0.672</td>\n",
       "      <td>32</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "      <td>66</td>\n",
       "      <td>23</td>\n",
       "      <td>94</td>\n",
       "      <td>28.1</td>\n",
       "      <td>0.167</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>137</td>\n",
       "      <td>40</td>\n",
       "      <td>35</td>\n",
       "      <td>168</td>\n",
       "      <td>43.1</td>\n",
       "      <td>2.288</td>\n",
       "      <td>33</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pregnant  glucose  bp  skin  insulin   bmi  pedigree  age  label\n",
       "0         6      148  72    35        0  33.6     0.627   50      1\n",
       "1         1       85  66    29        0  26.6     0.351   31      0\n",
       "2         8      183  64     0        0  23.3     0.672   32      1\n",
       "3         1       89  66    23       94  28.1     0.167   21      0\n",
       "4         0      137  40    35      168  43.1     2.288   33      1"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the first 5 rows of data\n",
    "pima.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question:** Can we predict the diabetes status of a patient given their health measurements?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define X and y\n",
    "feature_cols = ['pregnant', 'insulin', 'bmi', 'age']\n",
    "X = pima[feature_cols]\n",
    "y = pima.label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split X and y into training and testing sets\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
       "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
       "          verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train a logistic regression model on the training set\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "logreg = LogisticRegression()\n",
    "logreg.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make class predictions for the testing set\n",
    "y_pred_class = logreg.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Classification accuracy:** percentage of correct predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.6927083333333334\n"
     ]
    }
   ],
   "source": [
    "# calculate accuracy\n",
    "from sklearn import metrics\n",
    "print(metrics.accuracy_score(y_test, y_pred_class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Null accuracy:** accuracy that could be achieved by always predicting the most frequent class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    130\n",
       "1     62\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the class distribution of the testing set (using a Pandas Series method)\n",
    "y_test.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3229166666666667"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate the percentage of ones\n",
    "y_test.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6770833333333333"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate the percentage of zeros\n",
    "1 - y_test.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6770833333333333"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate null accuracy (for binary classification problems coded as 0/1)\n",
    "max(y_test.mean(), 1 - y_test.mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.677083\n",
       "Name: label, dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate null accuracy (for multi-class classification problems)\n",
    "y_test.value_counts().head(1) / len(y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Comparing the **true** and **predicted** response values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0]\n",
      "Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]\n"
     ]
    }
   ],
   "source": [
    "# print the first 25 true and predicted responses\n",
    "print('True:', y_test.values[0:25])\n",
    "print('Pred:', y_pred_class[0:25])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Conclusion:**\n",
    "\n",
    "- Classification accuracy is the **easiest classification metric to understand**\n",
    "- But, it does not tell you the **underlying distribution** of response values\n",
    "- And, it does not tell you what **\"types\" of errors** your classifier is making"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confusion matrix\n",
    "\n",
    "Table that describes the performance of a classification model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[118  12]\n",
      " [ 47  15]]\n"
     ]
    }
   ],
   "source": [
    "# IMPORTANT: first argument is true values, second argument is predicted values\n",
    "print(metrics.confusion_matrix(y_test, y_pred_class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Small confusion matrix](images/09_confusion_matrix_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Every observation in the testing set is represented in **exactly one box**\n",
    "- It's a 2x2 matrix because there are **2 response classes**\n",
    "- The format shown here is **not** universal"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Basic terminology**\n",
    "\n",
    "- **True Positives (TP):** we *correctly* predicted that they *do* have diabetes\n",
    "- **True Negatives (TN):** we *correctly* predicted that they *don't* have diabetes\n",
    "- **False Positives (FP):** we *incorrectly* predicted that they *do* have diabetes (a \"Type I error\")\n",
    "- **False Negatives (FN):** we *incorrectly* predicted that they *don't* have diabetes (a \"Type II error\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0]\n",
      "Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]\n"
     ]
    }
   ],
   "source": [
    "# print the first 25 true and predicted responses\n",
    "print('True:', y_test.values[0:25])\n",
    "print('Pred:', y_pred_class[0:25])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save confusion matrix and slice into four pieces\n",
    "confusion = metrics.confusion_matrix(y_test, y_pred_class)\n",
    "TP = confusion[1, 1]\n",
    "TN = confusion[0, 0]\n",
    "FP = confusion[0, 1]\n",
    "FN = confusion[1, 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Large confusion matrix](images/09_confusion_matrix_2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metrics computed from a confusion matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Classification Accuracy:** Overall, how often is the classifier correct?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.6927083333333334\n",
      "0.6927083333333334\n"
     ]
    }
   ],
   "source": [
    "print((TP + TN) / float(TP + TN + FP + FN))\n",
    "print(metrics.accuracy_score(y_test, y_pred_class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Classification Error:** Overall, how often is the classifier incorrect?\n",
    "\n",
    "- Also known as \"Misclassification Rate\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.3072916666666667\n",
      "0.30729166666666663\n"
     ]
    }
   ],
   "source": [
    "print((FP + FN) / float(TP + TN + FP + FN))\n",
    "print(1 - metrics.accuracy_score(y_test, y_pred_class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Sensitivity:** When the actual value is positive, how often is the prediction correct?\n",
    "\n",
    "- How \"sensitive\" is the classifier to detecting positive instances?\n",
    "- Also known as \"True Positive Rate\" or \"Recall\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.24193548387096775\n",
      "0.24193548387096775\n"
     ]
    }
   ],
   "source": [
    "print(TP / float(TP + FN))\n",
    "print(metrics.recall_score(y_test, y_pred_class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Specificity:** When the actual value is negative, how often is the prediction correct?\n",
    "\n",
    "- How \"specific\" (or \"selective\") is the classifier in predicting positive instances?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9076923076923077\n"
     ]
    }
   ],
   "source": [
    "print(TN / float(TN + FP))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**False Positive Rate:** When the actual value is negative, how often is the prediction incorrect?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.09230769230769231\n"
     ]
    }
   ],
   "source": [
    "print(FP / float(TN + FP))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Precision:** When a positive value is predicted, how often is the prediction correct?\n",
    "\n",
    "- How \"precise\" is the classifier when predicting positive instances?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.5555555555555556\n",
      "0.5555555555555556\n"
     ]
    }
   ],
   "source": [
    "print(TP / float(TP + FP))\n",
    "print(metrics.precision_score(y_test, y_pred_class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Conclusion:**\n",
    "\n",
    "- Confusion matrix gives you a **more complete picture** of how your classifier is performing\n",
    "- Also allows you to compute various **classification metrics**, and these metrics can guide your model selection\n",
    "\n",
    "**Which metrics should you focus on?**\n",
    "\n",
    "- Choice of metric depends on your **business objective**\n",
    "- **Spam filter** (positive class is \"spam\"): Optimize for **precision or specificity** because false negatives (spam goes to the inbox) are more acceptable than false positives (non-spam is caught by the spam filter)\n",
    "- **Fraudulent transaction detector** (positive class is \"fraud\"): Optimize for **sensitivity** because false positives (normal transactions that are flagged as possible fraud) are more acceptable than false negatives (fraudulent transactions that are not detected)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adjusting the classification threshold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the first 10 predicted responses\n",
    "logreg.predict(X_test)[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.63247571, 0.36752429],\n",
       "       [0.71643656, 0.28356344],\n",
       "       [0.71104114, 0.28895886],\n",
       "       [0.5858938 , 0.4141062 ],\n",
       "       [0.84103973, 0.15896027],\n",
       "       [0.82934844, 0.17065156],\n",
       "       [0.50110974, 0.49889026],\n",
       "       [0.48658459, 0.51341541],\n",
       "       [0.72321388, 0.27678612],\n",
       "       [0.32810562, 0.67189438]])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the first 10 predicted probabilities of class membership\n",
    "logreg.predict_proba(X_test)[0:10, :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.36752429, 0.28356344, 0.28895886, 0.4141062 , 0.15896027,\n",
       "       0.17065156, 0.49889026, 0.51341541, 0.27678612, 0.67189438])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the first 10 predicted probabilities for class 1\n",
    "logreg.predict_proba(X_test)[0:10, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# store the predicted probabilities for class 1\n",
    "y_pred_prob = logreg.predict_proba(X_test)[:, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# allow plots to appear in the notebook\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0,0.5,'Frequency')"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEWCAYAAABi5jCmAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAHXBJREFUeJzt3XmcHVWd9/HPNwlLWAMkIIalIYT9UcCwic4gIi8E2UYQGBBQEAFnABEGRAbhUWdgkEF9mEd2WWWVJQMqQiQEGLaw7xIhQGQLa9jX3/xxTidVN7e7q5u+t7o73/frdV9dy6mqX527/O451feUIgIzM7NOw+oOwMzMBhYnBjMzK3FiMDOzEicGMzMrcWIwM7MSJwYzMytxYhiEJD0kadO646iTpB0kPSPpTUnr1nD8yZL2ydO7SfpTG47ZISkkjWj1sfLxQtIqfdx2uqTNu1j3RUmPNSsr6UhJZ3Sz37bU9bzOiWGAafaGkrSXpJs75yNirYiY3MN+2vohUoOfA/8UEYtExD11BhIRF0TEFj2Vk3SMpPPbEdNAFhE3RcRqXaz7t4joTLhzvYar1rV9Mk4M1icDIOGsCDzUHzsaAOfSdvPiOVt1TgyDUEPTewNJUyXNkvSCpP/Mxabkv6/l7paNJQ2TdJSkpyS9KOlcSYsX9rtHXveypH9tOM4xki6TdL6kWcBe+di3SnpN0nOSTpY0f2F/IekASY9LekPSTySNy9vMknRJsXzDOTaNVdICkt4EhgP3SfprF9uHpAMlPSHpJUknSBqW1+0l6RZJJ0l6BTgmL/+2pEckvSrpWkkrFvb3FUmPSnpd0smACutKLTpJa0m6TtIr+Tk5UtKWwJHAzvn5uC+XXVzSmbn+/ibpp5KG53XDJf08x/8EsHWF18UPJT2cz+E3khbM6zaVNEPS4ZKeB36Tl39H0rQc60RJn27Y7VZd1OE4SX/Or5WXJF0gaVTDtut3F0sX51BsVTV7DTfW9eqFun5M0jcK67bKx38j1+2h3dWfFUSEHwPoAUwHNm9Ythdwc7MywK3AN/P0IsBGeboDCGBEYbtvA9OAlXPZy4Hz8ro1gTeBLwDzk7pqPigc55g8vz3pC8VI4HPARsCIfLxHgIMLxwtgIrAYsBbwHjApH39x4GFgzy7qoctYC/tepZt6DOAGYElgBeAvwD6F+vwQ+Occ+8h8XtOANfKyo4D/yeVHA7OAHYH5gO/n7fdpfH6ARYHngB8AC+b5DQt1eH5DnFcCpwILA0sDdwDfzev2Ax4Fls/ncUPjc9rktfNgofwtwE/zuk1zzMcDC+Rz3gx4CVgvL/t/wJSKdbgK8JW83RjSh/gvehHLjC5ez7PriOav4WJdLww8A3wrP2fr5fNZK69/Dvhinl4CWK/u9/dgedQegB8NT0h6k7wJvFZ4vE3XiWEKcCwwumE/zd5Uk4ADCvOrkT7sRwBHAxcW1i0EvN/whp3SQ+wHA1cU5gPYpDB/F3B4Yf7E4odJw766jLWw754Sw5aF+QOASXl6L+DphvJ/APYuzA/L9b4isAdwW2GdgBk0Twy7Avd0EdPsD708vwwpWY4sLNsVuCFP/xnYr7Bui8bntMlrp1h+K+CveXrT/HwuWFh/JvAfhflFch139FSHTY69ffG8K8TSH4lhZ+CmhjhOBX6cp58Gvgss1q7371B5uCtpYNo+IkZ1PkhvyK7sDawKPCrpTklf66bsp4GnCvNPkZLCMnndM50rIuJt4OWG7Z8pzkhaVdLVkp7P3Uv/Rvp2XfRCYfqdJvOL9CHWqorxPpX32WwdpATwy9wt9hrwCikBjGXuuokm23daHmjavdXEiqQWyHOF455KajnQeFzK9dGV7s55ZkS8W5gv1XFEvEl6zsf2tD9JS0u6KHfRzALOZ+7nvrtY+sOKwIaddZfrbzfgU3n910kJ6SlJN0rauJ+PP2Q5MQxyEfF4ROxK+jA5HrhM0sKkb1qNniW9mTqtQOpeeIHU7F6uc4WkkcBSjYdrmP81qatjfEQsRupDF/2ju1irWr5h+2cL843n8gypC2dU4TEyIv6HVDez9yVJDftu3M+4LtY1O+Z7pNZe5zEXi4i18vrScfM59KQ351yq4/y6WQr4W4X9/Xve32fyc787cz/33cVSRU9DPz8D3NjwnC0SEfsDRMSdEbEd6b1xJXBJL48/z3JiGOQk7S5pTER8TOp2AvgImAl8TOqj73Qh8H1JK0lahPQN/+KI+BC4DNhG0ueVLggfS88f8ouS+t7flLQ6sH+/nVj3sVZ1mKQlJC0PHARc3E3ZU4AfSloLZl8U3imvuwZYS9I/KP03z4HM+Vba6GrgU5IOVrpQvqikDfO6F4COzgu4EfEc8CfgREmLKV1wHyfp73P5S4ADJS0naQngiArn/L1cfklSou7unH8LfEvSOpIWINXx7RExvVCmqzpclNzlKWkscNgnjKWZZq/hoquBVSV9U9J8+bG+pDUkza/0m4fFI+ID0uv0o14ef57lxDD4bQk8pPSfOr8EdomId3NX0M+AW3IzeyPgLOA80nWJJ4F3SRdgiYiH8vRFpG+qbwAvkr7RduVQ4B9z2dPp/Ru/O13G2gtXka5r3Ev6cD+zq4IRcQWpxXVR7hp5EPhqXvcSsBNwHKmrZTzpYmqz/bxBuii7DfA88Djwpbz60vz3ZUl35+k9SBf7HwZeJSXoZfO604FrgfuAu0kX4HvyW1KyeSI/ftrNOU8C/hX4Hek5Hwfs0lCsqzo8lnSx9/W8vFlslWPpIr5mr+Hi+jdI1112IbVGnmfOxXWAbwLT8/O5H6lVYxUoX6QxK8nf0l8jdRM9WXc8vSUpSLFPqzuWdpE0nXRB/Pq6Y7HBzS0Gm03SNpIWyn3NPwceIP3HiJnNQ5wYrGg7UpP8WVJ3yS7hJqXZPMddSWZmVuIWg5mZlQyKgbRGjx4dHR0ddYdhZjao3HXXXS9FxJjebjcoEkNHRwdTp06tOwwzs0FFUpVfy8/FXUlmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVjIofvlsPes44ppajz/9uK1rPb6Z9R+3GMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEqcGMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzEicHMzEo8iF4/qHsAOzOz/uQWg5mZlTgxmJlZiRODmZmVODGYmVmJE4OZmZU4MZiZWYkTg5mZlTgxmJlZiRODmZmVODGYmVlJyxODpOGS7pF0dZ5fSdLtkh6XdLGk+Vsdg5mZVdeOFsNBwCOF+eOBkyJiPPAqsHcbYjAzs4pamhgkLQdsDZyR5wVsBlyWi5wDbN/KGMzMrHda3WL4BfAvwMd5fingtYj4MM/PAMY221DSvpKmSpo6c+bMFodpZmadWpYYJH0NeDEi7ioublI0mm0fEadFxISImDBmzJiWxGhmZnNr5f0YNgG2lbQVsCCwGKkFMUrSiNxqWA54toUxmJlZL7WsxRARP4yI5SKiA9gF+HNE7AbcAOyYi+0JXNWqGMzMrPfq+B3D4cAhkqaRrjmcWUMMZmbWhbbc2jMiJgOT8/QTwAbtOK6ZmfWef/lsZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW0pZB9MzaoeOIa2o9/vTjtq71+Gb9xS0GMzMrcWIwM7MSJwYzMytxYjAzsxInBjMzK3FiMDOzEicGMzMrcWIwM7MSJwYzMytxYjAzsxInBjMzK3FiMDOzEicGMzMrcWIwM7MSJwYzMytxYjAzsxInBjMzK/Ed3Kxf1H33NDPrP24xmJlZiRODmZmVODGYmVmJE4OZmZU4MZiZWUnLEoOkBSXdIek+SQ9JOjYvX0nS7ZIel3SxpPlbFYOZmfVeK1sM7wGbRcRngXWALSVtBBwPnBQR44FXgb1bGIOZmfVSyxJDJG/m2fnyI4DNgMvy8nOA7VsVg5mZ9V5LrzFIGi7pXuBF4Drgr8BrEfFhLjIDGNvFtvtKmipp6syZM1sZppmZFbQ0MUTERxGxDrAcsAGwRrNiXWx7WkRMiIgJY8aMaWWYZmZWUCkxSFr7kxwkIl4DJgMbAaMkdQ7FsRzw7CfZt5mZ9a+qLYZT8n8YHSBpVJUNJI3pLCtpJLA58AhwA7BjLrYncFUvYzYzsxaqlBgi4gvAbsDywFRJv5X0lR42Wxa4QdL9wJ3AdRFxNXA4cIikacBSwJl9jt7MzPpd5dFVI+JxSUcBU4FfAetKEnBkRFzepPz9wLpNlj9But5gZmYDUNVrDJ+RdBKpK2gzYJuIWCNPn9TC+MzMrM2qthhOBk4ntQ7e6VwYEc/mVoSZmQ0RVRPDVsA7EfERgKRhwIIR8XZEnNey6MzMrO2q/lfS9cDIwvxCeZmZmQ0xVRPDgoXhLcjTC7UmJDMzq1PVxPCWpPU6ZyR9Dninm/JmZjZIVb3GcDBwqaTOXykvC+zcmpDMzKxOlRJDRNwpaXVgNUDAoxHxQUsjMzOzWlT+gRuwPtCRt1lXEhFxbkuiMjOz2lRKDJLOA8YB9wIf5cUBODGYmQ0xVVsME4A1I6LpENlmZjZ0VP2vpAeBT7UyEDMzGxiqthhGAw9LuoN0L2cAImLblkRlZma1qZoYjmllEGZmNnBU/XfVGyWtCIyPiOslLQQMb21oZmZWh6rDbn8HuAw4NS8aC1zZqqDMzKw+VS8+fw/YBJgF6aY9wNKtCsrMzOpTNTG8FxHvd85IGkH6HYOZmQ0xVRPDjZKOBEbmez1fCvx368IyM7O6VE0MRwAzgQeA7wK/B3znNjOzIajqfyV9TLq15+mtDcfMzOpWdaykJ2lyTSEiVu73iMzMrFa9GSup04LATsCS/R+OmZnVrdI1hoh4ufD4W0T8AtisxbGZmVkNqnYlrVeYHUZqQSzakojMzKxWVbuSTixMfwhMB77R79GYmVntqv5X0pdaHYiZmQ0MVbuSDulufUT8Z/+EY2ZmdevNfyWtD0zM89sAU4BnWhGUmZnVpzc36lkvIt4AkHQMcGlE7NOqwMzMrB5Vh8RYAXi/MP8+0NHv0ZiZWe2qthjOA+6QdAXpF9A7AOe2LCozM6tN1f9K+pmkPwBfzIu+FRH3tC4sMzOrS9WuJICFgFkR8UtghqSVWhSTmZnVqOqtPX8MHA78MC+aDzi/VUGZmVl9qrYYdgC2Bd4CiIhn8ZAYZmZDUtXE8H5EBHnobUkL97SBpOUl3SDpEUkPSTooL19S0nWSHs9/l+h7+GZm1t+qJoZLJJ0KjJL0HeB6er5pz4fADyJiDWAj4HuS1iTdDW5SRIwHJuV5MzMbIKr+V9LP872eZwGrAUdHxHU9bPMc8FyefkPSI8BYYDtg01zsHGAy6fqFmZkNAD0mBknDgWsjYnOg22TQzT46gHWB24FlctIgIp6TtHQX2+wL7Auwwgor9OWwZmbWBz12JUXER8DbkhbvywEkLQL8Djg4ImZV3S4iTouICRExYcyYMX05tJmZ9UHVXz6/Czwg6TryfyYBRMSB3W0kaT5SUrggIi7Pi1+QtGxuLSwLvNiHuM3MrEWqJoZr8qMySQLOBB5pGJZ7IrAncFz+e1Vv9mtmZq3VbWKQtEJEPB0R5/Rh35sA3yS1NO7Ny44kJYRLJO0NPA3s1Id9m5lZi/TUYrgSWA9A0u8i4utVdxwRNwPqYvWXq+7HzMzaq6eLz8UP9pVbGYiZmQ0MPSWG6GLazMyGqJ66kj4raRap5TAyT5PnIyIWa2l0ZmbWdt0mhogY3q5AzMxsYOjN/RjMzGwe4MRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlI+oOwMz6T8cR19R6/OnHbV3r8a1/uMVgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJS1LDJLOkvSipAcLy5aUdJ2kx/PfJVp1fDMz65tWthjOBrZsWHYEMCkixgOT8ryZmQ0gLUsMETEFeKVh8XbAOXn6HGD7Vh3fzMz6pt3XGJaJiOcA8t+luyooaV9JUyVNnTlzZtsCNDOb1w3Yi88RcVpETIiICWPGjKk7HDOzeUa7E8MLkpYFyH9fbPPxzcysB+1ODBOBPfP0nsBVbT6+mZn1oGWD6Em6ENgUGC1pBvBj4DjgEkl7A08DO/XHseoeOMwM/Dq0oaNliSEidu1i1ZdbdUwzM/vkBuzFZzMzq4cTg5mZlTgxmJlZiRODmZmVODGYmVmJE4OZmZU4MZiZWYkTg5mZlTgxmJlZiRODmZmVODGYmVmJE4OZmZU4MZiZWYkTg5mZlTgxmJlZiRODmZmVtOxGPWY27xkId7GbftzWdYcw6LnFYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJU4MZmZW4sRgZmYlTgxmZlbixGBmZiVODGZmVuLEYGZmJbUkBklbSnpM0jRJR9QRg5mZNdf2xCBpOPBfwFeBNYFdJa3Z7jjMzKy5OloMGwDTIuKJiHgfuAjYroY4zMysiRE1HHMs8ExhfgawYWMhSfsC++bZ9yQ92IbYBoPRwEt1BzFAuC7mcF1kOt51UbBaXzaqIzGoybKYa0HEacBpAJKmRsSEVgc2GLgu5nBdzOG6mMN1MYekqX3Zro6upBnA8oX55YBna4jDzMyaqCMx3AmMl7SSpPmBXYCJNcRhZmZNtL0rKSI+lPRPwLXAcOCsiHioh81Oa31kg4brYg7XxRyuizlcF3P0qS4UMVf3vpmZzcP8y2czMytxYjAzs5IBlRh6GipD0gKSLs7rb5fU0f4oW69CPRwi6WFJ90uaJGnFOuJsh6rDp0jaUVJIGrL/plilLiR9I782HpL023bH2C4V3iMrSLpB0j35fbJVHXG2g6SzJL3Y1W+9lPwq19X9ktbrcacRMSAepAvRfwVWBuYH7gPWbChzAHBKnt4FuLjuuGuqhy8BC+Xp/YdiPVSti1xuUWAKcBswoe64a3xdjAfuAZbI80vXHXeNdXEasH+eXhOYXnfcLayPvwPWAx7sYv1WwB9IvyHbCLi9p30OpBZDlaEytgPOydOXAV+W1OwHc4NZj/UQETdExNt59jbSb0GGoqrDp/wE+A/g3XYG12ZV6uI7wH9FxKsAEfFim2Nslyp1EcBieXpxhvBvpSJiCvBKN0W2A86N5DZglKRlu9vnQEoMzYbKGNtVmYj4EHgdWKot0bVPlXoo2pv0bWAo6rEuJK0LLB8RV7czsBpUeV2sCqwq6RZJt0nasm3RtVeVujgG2F3SDOD3wD+3J7QBqbefKbUMidGVKkNlVBpOY5CrfI6SdgcmAH/f0ojq021dSBoGnATs1a6AalTldTGC1J20KakVeZOktSPitRbH1m5V6mJX4OyIOFHSxsB5uS4+bn14A06vPzcHUouhylAZs8tIGkFqInbXhBqMKg0ZImlz4EfAthHxXptia7ee6mJRYG1gsqTppP7TiUP0AnTV98dVEfFBRDwJPEZKFENNlbrYG7gEICJuBRYkDTQ4L+r1MEQDKTFUGSpjIrBnnt4R+HPkqytDSI/1kLtPTiUlhaHajww91EVEvB4RoyOiIyI6SNdbto2IPg0cNsBVeX9cSfrHBCSNJnUtPdHWKNujSl08DXwZQNIapMQws61RDhwTgT3yfydtBLweEc91t8GA6UqKLobKkPR/gakRMRE4k9QknEZqKexSX8StUbEeTgAWAS7N196fjohtawu6RSrWxTyhYl1cC2wh6WHgI+CwiHi5vqhbo2Jd/AA4XdL3Sd0mew3BL5EASLqQ1H04Ol9T+TEwH0BEnEK6xrIVMA14G/hWj/sconVlZmZ9NJC6kszMbABwYjAzsxInBjMzK3FiMDOzEicGMzMrcWKYx0n6SNK9kh6UdKmkhT7BvjaVdHWe3raH0VBHSTqgD8c4RtKhfY2xm/3Ojr0X20zPvxdoXL6fpD3y9NmSdszTZ0haM08f2R9x530dKOkRSRf0UG5y54//JP1e0qgeyr/Zyzi27zw/G9ycGOydiFgnItYG3gf2K67MP4rp9eskIiZGxHHdFBlFGi23bfKv5VsuIk6JiHObLN8nIh7Os/2WGEj1uFVE7FZ1g4jYqgVDZWxPGsnUBjknBiu6CVhFUkf+Bvr/gbuB5SVtIelWSXfnlsUiMHtc/Ecl3Qz8Q+eOJO0l6eQ8vYykKyTdlx+fB44DxuXWygm53GGS7sxjxh9b2NePlMbevx5YrVng+Zv5KZJukvQXSV8rxHGppP8G/pQT3Qm5hfSApJ0Lu1ksx/lw3tewvI9fS5qqdI+DYxsOfZikO/JjlVy+aaum8xu7pOOAkfncL5D0E0kHFcr9TNKBTbY/JMf9oKSD87JTSMNPT8w/5iqWHynpolyfFwMjC+tmt3YkXSnprnx++zbs48T8nE+SNCYvGyfpj3mbmyStnp/TbYET8nmNa1Yub79TPof7JE1p9nxazeoeS9yPeh/Am/nvCOAq0v0dOoCPgY3yutGk+x0snOcPB44mDTPwDGk8HpHGprk6l9kLODlPXwwcnKeHk8a46qAwfjywBWkMfZG+sFxNGmf+c8ADwEKkYZSnAYc2OY+zgT/mbceTxodZMMcxA1gyl/s6cF2OYxnS0AnLkn45+i7pQ3Z4LrNj3mbJQuyTgc/k+enAj/L0HoVzP6YzxhxX534mk+8X0VnveboDuDtPDyPda2CphvPrrIeFSb96fwhYtxDH6CZ1cgjpV8EAnwE+LBx/9jaF8xsJPNh5bNIvhnfL00cXns9JwPg8vSFpaJrSufZQ7gFgbJ4eVfd7wI+5HwNmSAyrzUhJ9+bpm0jDjnwaeCrS2O2QBqdbE7hFaQiO+YFbgdWBJyPicQBJ5wOlb5zZZqQPTiLiI+B1SUs0lNkiP+7J84uQPuAXBa6IfP8JSd0Ng3FJpNEzH5f0RI4P4LqI6Bxs8QvAhTmOFyTdCKwPzALuiIgn8nEuzGUvA76Rv0mPICWRNYH78/4uLPw9qZvYuhQR0yW9rDQG1jLAPTH3UBZfINXDWzm+y4EvMqe+mvk74Ff5GPdLur+LcgdK2iFPL0+q95dJXw4uzsvPBy7PLcXPM2c4FoAFGnfYQ7lbgLMlXQJc3k38VhMnBnsnItYpLshv5LeKi0gfrrs2lFuH/hv2XMC/R8SpDcc4uBfHaCzXOd94LpW3l7QScCiwfkS8KulsUkuk2TafpC7OILVuPgWc1WR9X29I1W1MkjYFNgc2joi3JU2mfH6N+xoGvNb4mmmiy3IRsZ+kDYGtgXslrdMkEVqNfI3BqrgN2KTQh76QpFWBR4GVJI3L5XbtYvtJpC4qJA2XtBjwBqk10Ola4Nuac+1irKSlSV1YO+T+8kWBbbqJcydJw3I8K5OGnW40Bdg5xzGG9K36jrxuA6URO4cBOwM3k7qv3iK1cpYBvtqwv50Lf2/tJrZGH0iarzB/BbAlqfVybRdxb5/rfmFgB1ILrztTgN0AJK1N6k5qtDjwak4Kq5Nah52GkUYxBvhH4OaImAU8KWmnvF9J+mwuM/s57a6cpHERcXtEHA28RHlIaBsA3GKwHkXETEl7ARdK6uwOOCoi/pK7WK6R9BLpg3TtJrs4CDhN0t6kUT/3j4hble409iDwh4g4TGl45Ftzi+VNYPeIuDtfOL0XeIruPwwfA24kdcfsFxHvau47v14BbEy6T3AA/xIRz+cPxVtJF8X/D+lD9YqI+FjSPaQ+/SdI3SBFC0i6nfQh2lVibOY04H5Jd0fEbhHxvqQbSN+yP2osnOvhbOYksTMiortuJIBfA7/JXUj3FrYt+iOwXy7zGOlLQKe3gLUk3UW6W2JnEtwN+LWko0ijeF5Eqs+LSCOaHkhKKF2VO0FS53WpSXmZDSAeXdWGhPyheXVEXFZ3LH2RWyl3Azt1XrMxq4u7ksxqpvSjsGnAJCcFGwjcYjAzsxK3GMzMrMSJwczMSpwYzMysxInBzMxKnBjMzKzkfwEmtpXOY7yt5wAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# histogram of predicted probabilities\n",
    "plt.hist(y_pred_prob, bins=8)\n",
    "plt.xlim(0, 1)\n",
    "plt.title('Histogram of predicted probabilities')\n",
    "plt.xlabel('Predicted probability of diabetes')\n",
    "plt.ylabel('Frequency')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Decrease the threshold** for predicting diabetes in order to **increase the sensitivity** of the classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# predict diabetes if the predicted probability is greater than 0.3\n",
    "from sklearn.preprocessing import binarize\n",
    "y_pred_class = binarize([y_pred_prob], 0.3)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.36752429, 0.28356344, 0.28895886, 0.4141062 , 0.15896027,\n",
       "       0.17065156, 0.49889026, 0.51341541, 0.27678612, 0.67189438])"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the first 10 predicted probabilities\n",
    "y_pred_prob[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1., 0., 0., 1., 0., 0., 1., 1., 0., 1.])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the first 10 predicted classes with the lower threshold\n",
    "y_pred_class[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[118  12]\n",
      " [ 47  15]]\n"
     ]
    }
   ],
   "source": [
    "# previous confusion matrix (default threshold of 0.5)\n",
    "print(confusion)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[80 50]\n",
      " [16 46]]\n"
     ]
    }
   ],
   "source": [
    "# new confusion matrix (threshold of 0.3)\n",
    "print(metrics.confusion_matrix(y_test, y_pred_class))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7419354838709677\n"
     ]
    }
   ],
   "source": [
    "# sensitivity has increased (used to be 0.24)\n",
    "print(46 / float(46 + 16))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.6153846153846154\n"
     ]
    }
   ],
   "source": [
    "# specificity has decreased (used to be 0.91)\n",
    "print(80 / float(80 + 50))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Conclusion:**\n",
    "\n",
    "- **Threshold of 0.5** is used by default (for binary problems) to convert predicted probabilities into class predictions\n",
    "- Threshold can be **adjusted** to increase sensitivity or specificity\n",
    "- Sensitivity and specificity have an **inverse relationship**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ROC Curves and Area Under the Curve (AUC)\n",
    "\n",
    "**Question:** Wouldn't it be nice if we could see how sensitivity and specificity are affected by various thresholds, without actually changing the threshold?\n",
    "\n",
    "**Answer:** Plot the ROC curve!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAEWCAYAAAB42tAoAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3XmYXGWZ9/Hvj7AKBMREhbBvwcCwyeowGlkUGAWHQQQFRcEIM+jr+oLi64IbMoMLggoqBNEI4kaEKCpDgzJhk02IIgEiCYmyBggGBHK/fzxP0SeVqtOnmz5V1d2/z3XVlTpr3fVUuu56lvMcRQRmZmbtrNTtAMzMrLc5UZiZWSknCjMzK+VEYWZmpZwozMyslBOFmZmVcqKwUUfJeZIelXR9Ta8xT9K++fnHJH274nHTJX22jpjqIKlP0rE1nXtjSUskjcvLL5N0taQnJJ0+mHK1eq3c7QBseEiaB7wMeA5YAvwSOCEilhT2eRXwWWBXYBlwNXBiRMwp7DMeOAU4BFgP+CtwKfDZiHioI2/mhdsL2A/YMCKerPvFIuLzdb8GpCQDLIiIj3fi9eoWEfcBaxVWTQMeAsaHL/DqKa5RjC5vjIi1gB2BnYCPNjZI2hP4FXAJsAGwGXArcI2kzfM+qwJXANsC+wPjgVcBDwO71RW0pOH+wbIJMG8oSaKGWKy6TYA5LzRJ5Bqlv9uGU0T4MQoewDxg38LyacBlheXfAl9vcdwvgO/m58cCfwPWGsTrbgv8GngkH/uxvH46qRbS2G8q6ddwMd4TgduAp4GPAz9qOvdXgTPy83WA7wCLgPtJNaNxLeI5BniK/prVp/P6dwNzc5wzgQ0KxwTwn8BdwL1t3udRwF9ISfPkYnkDnwK+V9j3YlJN7DFSrW3bwrbpwDdzmT0BXAVsUti+TaE87wQOy+unAc8A/8jv6+d5/QbAj4EHgXuB9xXOtRtwI/B4/my+VPI5Hgzckve9G9g/r+8Djs3PtwD+J5fBQ8D3gXUL5zgxfzZP5Nj3KYsD2DSX/cq5XIrvb98W5boH8L/AYtKPnKmFbX3A54BrgKXAlt3+mxxNj64H4McwfZDLf3FtCPwB+GpefhHpi/O1LY57J7AoP78QOH8Qr7k26Yv7Q8DqeXn3vG06AyeKW4CNgDVIvyb/Tmp2ABiXz71HXv4ZcDawJvBS4HrgPW3iOhr4XWF57/zFtjOwGvA14OrC9iB9Oa8HrNHifFPyl9er8/FfAp6lfaJ4Vy6L1YCvALcUtk3PX6SNc321EWt+b/PzZ7JyjvchcqJpUaYrAb8HPgGsCmwO3AO8Pm+fDRyVn6/VKMsW7283UlLbL59zErBN3tZHf6LYMu+zGjCRlAS/krdNzrFvkJc3BbYoi4NComjz/p4v1xzTw8CBOcb98vLEQpz3kX64rAys0u2/ydH0cPVsdPmZpCdIf7APAJ/M69cj/XEtanHMImBCfv6SNvu08wbgrxFxekQ8FRFPRMR1gzj+jIiYHxFLI+IvwE3Am/K2vYG/R8S1kl4GHAC8PyKejIgHgC8Dh1d8nbcB50bETRHxNKlJbk9Jmxb2+UJEPBIRS1scfyhwaURcnY//f6Q+npYi4txcFk+Tvux2kLROYZfLCuc6OceyEak850XEeRHxbETcRKotHNrmpXYlfVGeEhH/iIh7gG/RXy7PAFtKmhARSyLi2jbnOSaXz68jYllE3B8Rf2rxvubmfZ6OiAdJCfM1efNzpAQyRdIqETEvIu4eZBxljgRmRcSsHOOvSbWUAwv7TI+IO3LZPTOE17A2nChGlzdFxNqkX+/b0J8AHiV9sa3f4pj1Sb9aIf1Ca7VPOxuRmimGan7T8gzgiPz8rXkZUm1jFWCRpMWSFpNqFy+t+DobkJqNAIjUwf8w6Vdqu1iaj39+e6S+j4db7ShpnKRTJd0t6XFSzQn6P4vlXivH8kh+jU2A3RvvMb/PtwEvbxPXJsAGTft/jDSoAVIC2Br4k6QbJL2hzXkqfY6SXirpQkn35/f2vcb7ioi5wPtJifGBvN8Gg4yjzCbAm5ve614s//+17DO0F8CJYhSKiKtI1fj/zstPkqr/b26x+2GkDmyA3wCvl7RmxZeaT2q3buVJUpNXQ6svu+ZOy4uBqZI2BP6N/kQxn9SPMSEi1s2P8RGxbcU4F5K+aADI7+8lpPb0drEULSJ9mTaOf1E+vpW3ktr79yX1q2zaOKywT/Fca5FqfAtJ7/OqwntcNyLWiojj28Q4n9SnUtx/7Yg4ECAi7oqII0gJ9YvAj9p8tmWfY9EXcgzbR8R40q/8599XRMyIiL1IZR35NQcTR5n5wAVN73XNiDi1sI9HStXEiWL0+gqwn6Qd8/JJwDskvU/S2pJenMfz7wl8Ou9zAekP8seStpG0kqSX5PHsB674ElwKvFzS+yWtls+7e952C3CgpPUkvZz0a7NUbs7oA84jfQH+Ma9fRBqxdbqk8TmuLSS9pv3ZljMDeKekHSWtBnweuC4i5lU8/kfAGyTtlUeGnUL7v521SUntYVKibDV09sDCuT6TY5lPKs+tJR0laZX82FXSK/JxfyP1QzRcDzwu6URJa+TazHaSdgWQdKSkiRGxjNQBDKmJqNl3SOWzTy7bSZK2afPelgCLJU0CPtLYIGmypL1z+T5F6lB+bpBxlPke8EZJr8/vc3VJjR8VVjMnilEqf+l+l9SeTkT8Dng96fqIRaSmmJ2AvSLirrzP06Rfwn8ide4+TvoymgCs0PcQEU+QOhXfSBrlcxfw2rz5AtLIlHmkL/mLKoY+I8cwo2n920kdtnNITWk/omIzWURcQSqHH5Pe+xZU798gIu4gjYqakY9/FFjQZvfvksr2/hxrq/b4GaT+o0eAV5Kalxrl+boc20JSmX6R1PYP6Qt9Sm56+VlEPEcq+x1JI54eAr5NqslAGuJ8h6QlpE7zwyPiqRbv73pSB/qXSZ3aV1GogRV8mtTB/hhwGfCTwrbVgFNzDH8l1R4+Npg4yuREenA+54OkHzQfwd9hHaEI19bMzKw9Z2MzMytVW6KQdK6kByTd3ma7JJ0haa6k2yTtXFcsZmY2dHXWKKaT2ibbOQDYKj+mAd+oMRYzMxui2hJFRFxN6qxr52DS1BGRL8BZV9JgxvCbmVkHdHMCtEksf4HMgrxuhSuDJU0j1TpYffXVX7nxxht3JMBet2zZMlZayd1M4LIocln0G4tl8dcnl/GP52DVccuvf/z+uQ9FxMShnLObiUIt1rUcghUR5wDnAEyePDnuvPPOOuMaMfr6+pg6dWq3w+gJLot+Lot+Y7Es3nL2bAAues+ey62X9JdW+1fRzVS7gMIVqqSJ7BZ2KRYzM2ujmzWKmcAJki4Edgcey1fgmpl1zIzr7uOSW+4feMcRYs6ix5my/vhhPWdtiULSD0iT002QtIB0JeoqABHxTWAWaebHuaTppd9ZVyxmZu1ccsv9tXy5dsuU9cdz8I6TBt5xEGpLFHkSsLLtjZvFmJl11ZT1x6/Qpm/9fNtHM+uaupt9Fi9eyjfunF26z2iqTdRlbI0bM7Oe0mj26aY6mmpGG9cozKyr6mz2ScNj3aT0QrlGYWZmpVyjMDOgO8NE3T8wMrhGYWZAd/oL3D8wMrhGYWbP8zBRa8WJwmyEa24yqjIktBU3A1k7bnoyG+GGq8nIzUDWjmsUZqNAscnIQ0JtuLlGYWZmpZwozMyslBOFmZmVcqIwM7NS7sw2G6Eaw2I9rNXq5hqF2QhVTBIe1mp1co3CbATzldTWCa5RmJlZKScKMzMr5URhZmalnCjMzKyUO7PNekzVGwh5WKx1imsUZj2m6mywHhZrneIahVkP8rBX6yWuUZiZWSnXKMx6QLFfwn0P1mtcozDrAcV+Cfc9WK9xjcKsR7hfwnqVE4XZMKs6vLXIzU3Wy0qbniTtKumrkm6StEjSPZJmSnqPpLU7FaTZSFJ1eGuRm5usl7WtUUi6FHgYuAQ4HXgAWB3YGngtcJmk0yLi0k4EajaSuBnJRpOypqdjIuJvTeueAq7Pjy9KemltkZmZWU9omygaSULSccAPIuKxFvs8UGNsZl3n/gazasNjNwVukjRD0r41x2PWU9zfYFZh1FNEnCTpY8ABwHGSvgH8ADg3IubVHJ9Z17m/wca6SsNjI2KZpHnAPOCfgPWBSyTNioiPtjtO0v7AV4FxwLcj4tSm7RsD5wPr5n1OiohZQ3gfZkMyUNOSm5HMKjQ9SfoPSdeTvvB/D2wfEe8GdgLeUnLcOOAsUk1kCnCEpClNu30c+GFE7AQcDnx9SO/CbIgGalpyM5JZtRrFhsDhEXFPcWWuZRxUctxuwNzGcZIuBA4G5hRPAzR+rq0DLKwauNlwcdOSWbkqiWKD5iQhaXpEHB0Rt5ccNwmYX1heAOzetM+ngF9Jei+wJtCys1zSNGAawMSJE+nr66sQ9ui3ZMkSl0U21LJYvHgpwKgqR/+/6OeyGB5VEsX2xQVJKwG7VjhOLdZF0/IRwPSIOF3SnsAFkraLiGXLHRRxDnAOwOTJk2Pq1KkVXn706+vrw2WRVCmLVv0RC5c+zZT1xzN16uipUfj/RT+XxfBo20ch6URJjwLbS3okPx4FHgKqdDgvADYqLG/Iik1LxwA/BIiI2aQrvycMIn6zylr1R7gPwmxgZTWK00hTd3wBOKmxMiKeq3juG4CtJG0G3E/qrH5r0z73AfsA0yW9gpQoHqx4frNBc3+E2eCVJYotI+IuSRcA2zZWSqlFKSJuKztxRDwr6QTgctLQ13Mj4g5JpwA3RsRM4EPAtyR9gNQsdXRENDdPmZlZF5UlipNITUNntdgWwKsHOnm+JmJW07pPFJ7PAf65UqRmZtYVZXM9HZP//ZfOhWNmZr2mygV3N0n6iKRNOhGQmZn1lirDY99MugJ7pqS/AxcBF0fE4KbUNBsG7abcWLx4Kd+4c3bpsZ6Ow2xoBqxRRMTdEfH5iNgBeBfwSuAvtUdm1sJQZnNt8FBYs6GpNCmgpA2Bw0g1i5WBk+sMyqxMqyGu6cIqD3s1q8OAiULSNcDawMXAURHx59qjMjOznlGlRvGeAeZ0MjOzUaxtopB0RET8ANhb0t7N2yPijFojMzOznlBWo3hx/ndii22+etrMbIwou+CucROhyyLi2uI2SXvUGpWNOQPdaa7BQ1zNOm/A4bG0vutcq2k9zIas6rBXD3E167yyPordgD2BiZLeV9g0Hlil7sBs7PHMrma9qayPYk3SvSFWZvl+iidIV2ubmdkYUNZHcSVwpaTzmm+FamZmY0dZ09PpEfEh4HRJK4xyiohDao3MzMx6QlnT00X53zM7EYiZmfWmsqan6/O/VzTWSVoHmJRvOGRmZmNAlftRXCFpvKQXA38AZkj6r/pDMzOzXlDlOor1IuJx4BDg/IjYEXh9vWGZmVmvqJIoVpY0kTQk9uc1x2NmZj2mSqL4HHAVcF9EXC9pc+DeesMyM7NeMeA04xFxIXBhYfke4OA6gzIzs95R5cZFE0i3QN20uH9ETKsvLDMz6xVVblx0CXAt8DvguXrDsdFqoNlhPSusWe+qkijWzFdomw1ZY3bYdsnAs8Ka9a4qieIXkl4XEb+qPRob1Tw7rNnIVGXU03HALyUtkfSIpEclPVJ3YGZm1huq1Cgm1B6FmZn1rAFrFBHxHOliuxPz8/WBHesOzMzMekOVuZ7OBF4LHJVX/R34Zp1BmZlZ76jS9PSqiNhZ0s0AEfGIpFVrjsvMzHpElc7sZyStBASApJcAy2qNyszMekaVRHEW8GNgoqRPky68+2KtUZmZWc+oMtfTdyX9Htg3r3pzRNxeb1hmZtYr2tYoJK0uaRxARNwBXEZqctq86skl7S/pTklzJZ3UZp/DJM2RdIekGYOM38zMalbW9HQ5sAWApC2A64EpwAclfW6gE+ckcxZwQD7uCElTmvbZCvgo8M8RsS3w/qG8CTMzq09ZolgvIv6cn78DuDAijifd3e6gCufeDZgbEfdExD9IU5U3T0/+buCsiHgUICIeGFT0ZmZWu7I+iig83xs4HSAinpZUZdTTJGB+YXkBsHvTPlsDSLoGGAd8KiJ+2XwiSdOAaQATJ06kr6+vwsuPfkuWLOnJsuib/wyzFz673Lr7nljGxmuvVFu8vVoW3eCy6OeyGB5lieIOSacC95O+0H8FIGkdQBXO3WqfaFpeGdgKmApsCPxW0nYRsXi5gyLOAc4BmDx5ckydOrXCy49+fX199GJZfOPs2SxcuvxMseuuCwfvOImpu29cy2v2all0g8uin8tieJQlimOBDwDbAPtHxJN5/XbAlyqcewGwUWF5Q2Bhi32ujYhngHsl3UlKHDdUOL/1MM8UazZ6tE0UOTF8tsX6a4BrKpz7BmArSZuRaiWHA29t2udnwBHA9Hwnva2Be6qFbmZmnVA2PPZnkg6QtEIykbSJpE9Iele74yPiWeAE0uipPwI/jIg7JJ0iqdEZfjnwsKQ5wJXARyLi4RfyhszMbHiVNT39J/Ah4CxJfwMeBFYnXUdxH2m00o/LTh4Rs4BZTes+UXgewAfzw8zMelBZ09P95C9xSVuSphdfCtwZEU90KD4zM+uyKrPHEhFzgbk1x2JmZj2oyqSAZmY2hjlRmJlZqUqJQtKquZ/CzMzGmAH7KCT9K+kCu1WBzSTtCHwyIv6t7uCs98247j4uueX+5dbNWbT8VdlmNrJVqVGcQpqjaTFARNwCuHZhAFxyy/3MWfT4cuumrD+eg3ec1KWIzGy4VRn19ExELJaWm7qpec4mG8M8XYfZ6FYlUfxR0mHASnk6jv8DXFtvWNZNrZqT2nEzk9noV6Xp6QTglaS72/0EeIqULGyUatWc1I6bmcxGvyo1itdHxInAiY0Vkg4hJQ0bpdycZGYNVWoUH2+x7uThDsTMzHpT2xqFpNcD+wOTJBXvPzGe1AxlZmZjQFnT0wPA7aQ+iTsK658ATqozKDMz6x1ls8feDNws6fsR8VQHYzIzsx5SpTN7kqTPAVNI96MAICK2ri0qq0XVYa8e8mpmRVU6s6cD5wECDgB+CFxYY0xWk6rDXj3k1cyKqtQoXhQRl0v674i4G/i4pN/WHZjVw8NezWywqiSKp5Xm77hb0nHA/cBL6w3LzMx6RZVE8QFgLeB9wOeAdYB31RmUtdbcx7B48VK+cefsyse778HMhmLARBER1+WnTwBHAUjasM6grLVGH8NQv+zd92BmQ1GaKCTtCkwCfhcRD0naljSVx96Ak0UXFPsY+vr6mDrV/Q1mVq+2o54kfQH4PvA24JeSTgauBG4FPDS2g2Zcdx9vOXt25Yn6zMyGU1mN4mBgh4hYKmk9YGFevrMzoVlDscnJTUdm1mllieKpiFgKEBGPSPqTk0T3eFirmXVLWaLYXFJjKnEBmxaWiYhDao3MzMx6Qlmi+Pem5TPrDGQsG2hqDQ9rNbNuKpsU8IpOBjKWDTTs1X0TZtZNVS64sw5wH4SZ9Sonii5qNDm5acnMelmV2WMBkLRanYGMRR72amYjwYA1Ckm7Ad8hzfG0saQdgGMj4r11BzcWuMnJzHpdlRrFGcAbgIcBIuJW4LV1BmVmZr2jSqJYKSL+0rTuuTqCMTOz3lOlM3t+bn4KSeOA9wJ/rjcsMzPrFVVqFMcDHwQ2Bv4G7JHXDUjS/pLulDRX0kkl+x0qKSTtUuW8ZmbWOVVqFM9GxOGDPXGufZwF7AcsAG6QNDMi5jTttzbppkjXrXgWMzPrtio1ihskzZL0jvylXtVuwNyIuCci/gFcSJqRttlngNOApwZxbjMz65Aqd7jbQtKrgMOBT0u6BbgwIi4c4NBJwPzC8gJg9+IOknYCNoqISyV9uN2JJE0DpgFMnDiRvr6+gcIeERYvXgow5PezZMmSUVMWL5TLop/Lop/LYnhUujI7Iv4X+F9JnwK+Qrqh0UCJQq1O9fxGaSXgy8DRFV7/HOAcgMmTJ8fUqVOrhN3zGve7Hupd6tId7qYOY0Qjl8uin8uin8tieFS54G4tUpPR4cArgEuAV1U49wJgo8LyhqSbHzWsDWwH9EkCeDkwU9JBEXFjpehHoOJMsZ66w8xGgio1ituBnwOnRcRvB3HuG4CtJG0G3E9KNG9tbIyIx4AJjWVJfcCHR3OSgOWn7fDUHWY2ElRJFJtHxLLBnjginpV0AnA5MA44NyLukHQKcGNEzBzsOUcLT9thZiNJ20Qh6fSI+BDwY0nRvL3KHe4iYhYwq2ndJ9rsO3XAaM3MrOPKahQX5X99ZzszszGs7A531+enr4iI5ZJFblLyHfDMzMaAKhfcvavFumOGOxAzM+tNZX0UbyGNVNpM0k8Km9YGFtcdmJmZ9YayPorrSfeg2JA0Z1PDE8DNdQZlZma9o6yP4l7gXuA3nQvHzMx6TVnT01UR8RpJj1KYeoM0NUdExHq1RzdK+GpsMxvJypqeGrc7nVCyj1Xgq7HNbCQra3pqXI29EbAwIv4haS9ge+B7wOMdiG/U8NXYZjZSVRke+zPSbVC3AL5LmhhwRq1RmZlZz6gy19OyiHhG0iHAVyLiDEke9dRGsT+iwf0SZjaSValRPCvpzcBRwKV53Sr1hTSyNfojitwvYWYjWZUaxbuA/yBNM35Pnjb8B/WGNbK5P8LMRpMqt0K9XdL7gC0lbUO6D/bn6g/NzMx6QZU73P0LcAHp5kMCXi7pqIi4pu7gzMys+6o0PX0ZODAi5gBIegUpcexSZ2BmZtYbqnRmr9pIEgAR8Udg1fpCMjOzXlKlRnGTpLNJtQiAt+FJAVfQGBbrobBmNtpUSRTHAe8D/i+pj+Jq4Gt1BjUSFZOEh8Ka2WhSmigk/ROwBfDTiDitMyGNXB4Wa2ajUdnssR8j3cnuJmBXSadExLkdi6zHtLriushNTmY2WpV1Zr8N2D4i3gzsChzfmZB6U6srrovc5GRmo1VZ09PTEfEkQEQ8KKnKCKlRzU1LZjYWlSWKzQv3yhawRfHe2RFxSK2RmZlZTyhLFP/etHxmnYGYmVlvKrtx0RWdDMTMzHrTmO93MDOzclUuuBuzikNiPfzVzMaqyjUKSavVGUgvKg6J9fBXMxurqkwzvhvwHWAdYGNJOwDHRsR76w6uF3hIrJmNdVVqFGcAbwAeBoiIW4HX1hmUmZn1jip9FCtFxF8kFdc9V1M8PcEzwZqZ9auSKObn5qeQNA54L/DnesPqLs8Ea2bWr0qiOJ7U/LQx8DfgN4yBeZ/cN2FmlgyYKCLiAeDwoZxc0v7AV4FxwLcj4tSm7R8EjgWeBR4E3hURfxnKa71QHgprZtZalVFP3wKieX1ETBvguHHAWcB+wALgBkkzi7dVJd0pb5eI+Luk44HTgLcMIv5hU2xucpOTmVm/Kk1Pvyk8Xx34N2B+heN2A+ZGxD0Aki4EDgaK99++srD/tcCRFc5bGzc3mZmtqErT00XFZUkXAL+ucO5JLJ9QFgC7l+x/DPCLVhskTQOmAUycOJG+vr4KLz84ixcvBajl3HVZsmTJiIq3Ti6Lfi6Lfi6L4TGUKTw2AzapsJ9arFuhCQtA0pHALsBrWm2PiHOAcwAmT54cU6dOrRQoDHxnuoaFS59myvrjmTp15NQo+vr6GExZjGYui34ui34ui+FRpY/iUfq/4FcCHgFOqnDuBcBGheUNgYUtzr8vcDLwmoh4usJ5B6Xq9RDulzAza600UShdZbcD0PhJviwiWtYKWrgB2ErSZvn4w4G3Np1/J+BsYP88uqoW7nswMxu60kQRESHppxHxysGeOCKelXQCcDlpeOy5EXGHpFOAGyNiJvBfwFrAxfnK7/si4qBBv4smHupqZjZ8qvRRXC9p54i4abAnj4hZwKymdZ8oPN93sOeswkNdzcyGT9tEIWnliHgW2At4t6S7gSdJndQRETt3KMYhcXOTmdnwKKtRXA/sDLypQ7GYmVkPKksUAoiIuzsUi5mZ9aCyRDExz8XUUkR8qYZ4zMysx5QlinGkEUmtLpwzM7MxoixRLIqIUzoWiZmZ9aSyW6G6JmFmZqWJYp+ORWFmZj2rbaKIiEc6GYiZmfWmshqFmZmZE4WZmZVzojAzs1JOFGZmVsqJwszMSjlRmJlZKScKMzMrVeXGRSNG4852vqudmdnwGVU1imKS8F3tzMyGx6iqUYDvbGdmNtxGfKJoNDcBbnIyM6vBiG96ajQ3AW5yMjOrwYivUYCbm8zM6jTiaxRmZlavEVuj8FBYM7POGLE1Cg+FNTPrjBFbowD3TZiZdcKIrVGYmVlnOFGYmVkpJwozMyvlRGFmZqWcKMzMrJQThZmZlXKiMDOzUk4UZmZWyonCzMxK1ZooJO0v6U5JcyWd1GL7apIuytuvk7RpnfGYmdng1ZYoJI0DzgIOAKYAR0ia0rTbMcCjEbEl8GXgi3XFY2ZmQ1NnjWI3YG5E3BMR/wAuBA5u2udg4Pz8/EfAPpJUdtK/PrmMt5w9+/mbFZmZWb3qnBRwEjC/sLwA2L3dPhHxrKTHgJcADxV3kjQNmJYXn/7hca+6HeB24IfHDX/gI8gEmspqDHNZ9HNZ9HNZ9Js81APrTBStagYxhH2IiHOAcwAk3RgRu7zw8EY+l0U/l0U/l0U/l0U/STcO9dg6m54WABsVljcEFrbbR9LKwDrAIzXGZGZmg1RnorgB2ErSZpJWBQ4HZjbtMxN4R35+KPA/EbFCjcLMzLqntqan3OdwAnA5MA44NyLukHQKcGNEzAS+A1wgaS6pJnF4hVOfU1fMI5DLop/Lop/Lop/Lot+Qy0L+AW9mZmV8ZbaZmZVyojAzs1I9myg8/Ue/CmXxQUlzJN0m6QpJm3Qjzk4YqCwK+x0qKSSN2qGRVcpC0mH5/8YdkmZ0OsZOqfA3srGkKyXdnP9ODuxGnHWTdK6kByTd3ma7JJ2Ry+k2STtXOnFE9NyD1Pl9N7A5sCpwKzClaZ//AL6Znx8OXNTtuLtYFq8FXpSfHz+WyyLvtzZwNXAtsEu34+7i/4utgJuBF+fll3Y77i6WxTnA8fn5FGBet+OuqSxeDewM3N5m+4HAL0jXsO0BXFflvL1ao6hl+o8RasCyiIgrI+LvefFa0jUro1GV/xcAnwGrsd4pAAAIIUlEQVROA57qZHAdVqUs3g2cFRGPAkTEAx2OsVOqlEUA4/PzdVjxmq5RISKupvxatIOB70ZyLbCupPUHOm+vJopW039MardPRDwLNKb/GG2qlEXRMaRfDKPRgGUhaSdgo4i4tJOBdUGV/xdbA1tLukbStZL271h0nVWlLD4FHClpATALeG9nQus5g/0+AeqdwuOFGLbpP0aByu9T0pHALsBrao2oe0rLQtJKpFmIj+5UQF1U5f/FyqTmp6mkWuZvJW0XEYtrjq3TqpTFEcD0iDhd0p6k67e2i4hl9YfXU4b0vdmrNQpP/9GvSlkgaV/gZOCgiHi6Q7F12kBlsTawHdAnaR6pDXbmKO3Qrvo3cklEPBMR9wJ3khLHaFOlLI4BfggQEbOB1UkTBo41lb5PmvVqovD0H/0GLIvc3HI2KUmM1nZoGKAsIuKxiJgQEZtGxKak/pqDImLIk6H1sCp/Iz8jDXRA0gRSU9Q9HY2yM6qUxX3APgCSXkFKFA92NMreMBN4ex79tAfwWEQsGuignmx6ivqm/xhxKpbFfwFrARfn/vz7IuKgrgVdk4plMSZULIvLgddJmgM8B3wkIh7uXtT1qFgWHwK+JekDpKaWo0fjD0tJPyA1NU7I/TGfBFYBiIhvkvpnDgTmAn8H3lnpvKOwrMzMbBj1atOTmZn1CCcKMzMr5URhZmalnCjMzKyUE4WZmZVyohijJD0n6ZbCY9OSfTdtNxvlIF+zL8/weWueVmLyEM5xnKS35+dHS9qgsO3bkqYMc5w3SNqxwjHvl/SiIbzWVyS9Oj8/Ic/qGfm6h8Gea3KO/RZJf5Q0rHd3k3RQY2ZWSROVZm2+WdK/SJolad2SY9t+biXH/EbSi4fvHdiQdXu2Qz+68wCWDGLfTWkzG+UgX7OPPJsrMA2YOVznG+ayKcb5TuDXFY6ZB0wY5OusB1xbWN4pl/Wgz5WPvxw4uLD8TzX+/zkcOL/Oz410Qe3Jdb0HP6o/XKOw5+Waw28l3ZQfr2qxz7aSrs+/Wm+TtFVef2Rh/dmSxg3wclcDW+Zj98m/TP+gNJ/+ann9qeq/z8Z/53WfkvRhSYeS5rX6fn7NNfKv6V0kHS/ptELMR0v62hDjnE1h0jRJ35B0o9L9HT6d170P2AC4UtKVed3rJM3O5XixpLVanPtQ4JeNhYi4OSLmDRBPmfVJUzQ0zveHHMvRki6R9MtcU/pk4f20LA+l+zvclGtVVxTOc2auYZ0GHFgo+3mNWpCkt+fP7FZJF+R17T63f5X000I8+0n6SV6cSZqjybqt25nKj+48SFfq3pIfP83rXgSsnp9vRbqqFQo1CuBrwNvy81WBNYBXAD8HVsnrvw68vcVr9tH/S/0jwEWkqRTmA1vn9d8F3k/6tX0n/ReFrpv//RTw4ebzFZeBiaRppxvrfwHsNcQ43w98vrBtvfzvuLzf9nl5HrkWQJpD6Gpgzbx8IvCJFq9zPvDGFuufP9cgP9N3kmZR/gXwgUKZHQ0sIs2uvAZwey6nluWRy28+sFnTez4aOLP5eTFmYNv8uU1oOrbl50aapO5PwMS8PKNYJsBdwEu6/fcy1h89OYWHdcTSiGhue18FaPxifI40N1Cz2cDJkjYEfhIRd0naB3glcIPSFCJrAO3mnPq+pKWkL5b3ApOBeyPiz3n7+cB/AmeS7ifxbUmXAZWnDY+IByXdozSXzV35Na7J5x1MnGuSEkLxLmCHSZpGmv5mfdJNcG5rOnaPvP6a/Dqrksqt2foM43xDEXGepMuB/Un3HXiPpB3y5l9Hnr4j/2LfC3iW1uWxB3B1pIkEiYjBTLa5N/CjiHioyrEREbnWcaSk84A9Scmq4QFSbW3UTT0ykjhRWNEHgL8BO5AGOqxw45+ImCHpOuBfgcslHUv6VXh+RHy0wmu8LQqT9ElqeQ+RSPP37EaayO1w4ATSl1BVFwGHkX6t/jR/IQ0qTtKd0k4FzgIOkbQZ8GFg14h4VNJ0Uo2omUhfzAM1myxtc3xb+ct0J2BhRKxwO8+IWAicC5yrNABhu8am5l1p87lJOqjF/pVDHMKx55FqNk8BF0e6v0zD6qRysi5yH4UVrQMsijRH/1GkX9PLkbQ5cE9EnEFqQ94euAI4VNJL8z7rqfp9u/8EbCppy7x8FHBVbtNfJyJmkZp/Wo08eoI0tXgrPwHeRGrjviivG1ScEfEM8HFgD6UZR8cDTwKPSXoZcECbWK4F/rnxniS9SFKr2tkfyf00VUXEOyNix1ZJIvcrrJKfv5zU1HR/3rxffr9rkMrlGtqXx2zgNTkxImm9QYR4BanW9ZKSY5f73HJyW0gq6+mF9yPg5aTap3WRE4UVfR14h6RrSc1OT7bY5y3A7ZJuAbYh3VZxDumP/FeSbgN+TWpWGVBEPEVqW79Y0h+AZcA3SV8kl+bzXUWq7TSbDnyz0aHadN5HgTnAJhFxfV436DgjYilwOql9/VbSPajvIP1qv6aw6znALyRdGREPktrwf5Bf51pSWTW7jDTTJ5A6xZVm/NwQuE3St8tia+F1pM/mVtIIqI9ExF/ztt8BF5D6pH4cETe2K48c/zTgJ/lcFzW/UDsRcQfwOVKyvxX4UovdprPi5/Z9YH6OqeGVpFFhzzafwDrLs8eadZGk3wFviBrvOifpaFLn8Ql1vcYLJelM4OaI+E5h3VdJQ6iv6F5kBq5RmHXbh4CNux1EN0n6PakJ83tNm253kugNrlGYmVkp1yjMzKyUE4WZmZVyojAzs1JOFGZmVsqJwszMSv1/XvdgAxxw26gAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# IMPORTANT: first argument is true values, second argument is predicted probabilities\n",
    "fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)\n",
    "plt.plot(fpr, tpr)\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.0])\n",
    "plt.title('ROC curve for diabetes classifier')\n",
    "plt.xlabel('False Positive Rate (1 - Specificity)')\n",
    "plt.ylabel('True Positive Rate (Sensitivity)')\n",
    "plt.grid(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- ROC curve can help you to **choose a threshold** that balances sensitivity and specificity in a way that makes sense for your particular context\n",
    "- You can't actually **see the thresholds** used to generate the curve on the ROC curve itself"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define a function that accepts a threshold and prints sensitivity and specificity\n",
    "def evaluate_threshold(threshold):\n",
    "    print('Sensitivity:', tpr[thresholds > threshold][-1])\n",
    "    print('Specificity:', 1 - fpr[thresholds > threshold][-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sensitivity: 0.24193548387096775\n",
      "Specificity: 0.9076923076923077\n"
     ]
    }
   ],
   "source": [
    "evaluate_threshold(0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sensitivity: 0.7258064516129032\n",
      "Specificity: 0.6153846153846154\n"
     ]
    }
   ],
   "source": [
    "evaluate_threshold(0.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AUC is the **percentage** of the ROC plot that is **underneath the curve**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7245657568238213\n"
     ]
    }
   ],
   "source": [
    "# IMPORTANT: first argument is true values, second argument is predicted probabilities\n",
    "print(metrics.roc_auc_score(y_test, y_pred_prob))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- AUC is useful as a **single number summary** of classifier performance.\n",
    "- If you randomly chose one positive and one negative observation, AUC represents the likelihood that your classifier will assign a **higher predicted probability** to the positive observation.\n",
    "- AUC is useful even when there is **high class imbalance** (unlike classification accuracy)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7378233618233618"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate cross-validated AUC\n",
    "from sklearn.model_selection import cross_val_score\n",
    "cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Confusion matrix advantages:**\n",
    "\n",
    "- Allows you to calculate a **variety of metrics**\n",
    "- Useful for **multi-class problems** (more than two response classes)\n",
    "\n",
    "**ROC/AUC advantages:**\n",
    "\n",
    "- Does not require you to **set a classification threshold**\n",
    "- Still useful when there is **high class imbalance**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confusion Matrix Resources\n",
    "\n",
    "- Blog post: [Simple guide to confusion matrix terminology](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by me\n",
    "- Videos: [Intuitive sensitivity and specificity](https://www.youtube.com/watch?v=U4_3fditnWg) (9 minutes) and [The tradeoff between sensitivity and specificity](https://www.youtube.com/watch?v=vtYDyGGeQyo) (13 minutes) by Rahul Patwari\n",
    "- Notebook: [How to calculate \"expected value\"](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb) from a confusion matrix by treating it as a cost-benefit matrix (by Ed Podojil)\n",
    "- Graphic: How [classification threshold](https://media.amazonwebservices.com/blog/2015/ml_adjust_model_1.png) affects different evaluation metrics (from a [blog post](https://aws.amazon.com/blogs/aws/amazon-machine-learning-make-data-driven-decisions-at-scale/) about Amazon Machine Learning)\n",
    "\n",
    "\n",
    "## ROC and AUC Resources\n",
    "\n",
    "- Video: [ROC Curves and Area Under the Curve](https://www.youtube.com/watch?v=OAl6eAyP-yo) (14 minutes) by me, including [transcript and screenshots](http://www.dataschool.io/roc-curves-and-auc-explained/) and a [visualization](http://www.navan.name/roc/)\n",
    "- Video: [ROC Curves](https://www.youtube.com/watch?v=21Igj5Pr6u4) (12 minutes) by Rahul Patwari\n",
    "- Paper: [An introduction to ROC analysis](http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf) by Tom Fawcett\n",
    "- Usage examples: [Comparing different feature sets](http://research.microsoft.com/pubs/205472/aisec10-leontjeva.pdf) for detecting fraudulent Skype users, and [comparing different classifiers](http://www.cse.ust.hk/nevinZhangGroup/readings/yi/Bradley_PR97.pdf) on a number of popular datasets\n",
    "\n",
    "## Other Resources\n",
    "\n",
    "- scikit-learn documentation: [Model evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html)\n",
    "- Guide: [Comparing model evaluation procedures and metrics](https://github.com/justmarkham/DAT8/blob/master/other/model_evaluation_comparison.md) by me\n",
    "- Video: [Counterfactual evaluation of machine learning models](https://www.youtube.com/watch?v=QWCSxAKR-h0) (45 minutes) about how Stripe evaluates its fraud detection model, including [slides](http://www.slideshare.net/MichaelManapat/counterfactual-evaluation-of-machine-learning-models)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comments or Questions?\n",
    "\n",
    "- Email: <kevin@dataschool.io>\n",
    "- Website: http://dataschool.io\n",
    "- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "    @font-face {\n",
       "        font-family: \"Computer Modern\";\n",
       "        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
       "    }\n",
       "    div.cell{\n",
       "        width: 90%;\n",
       "/*        margin-left:auto;*/\n",
       "/*        margin-right:auto;*/\n",
       "    }\n",
       "    ul {\n",
       "        line-height: 145%;\n",
       "        font-size: 90%;\n",
       "    }\n",
       "    li {\n",
       "        margin-bottom: 1em;\n",
       "    }\n",
       "    h1 {\n",
       "        font-family: Helvetica, serif;\n",
       "    }\n",
       "    h4{\n",
       "        margin-top: 12px;\n",
       "        margin-bottom: 3px;\n",
       "       }\n",
       "    div.text_cell_render{\n",
       "        font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
       "        line-height: 145%;\n",
       "        font-size: 130%;\n",
       "        width: 90%;\n",
       "        margin-left:auto;\n",
       "        margin-right:auto;\n",
       "    }\n",
       "    .CodeMirror{\n",
       "            font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
       "    }\n",
       "/*    .prompt{\n",
       "        display: None;\n",
       "    }*/\n",
       "    .text_cell_render h5 {\n",
       "        font-weight: 300;\n",
       "        font-size: 16pt;\n",
       "        color: #4057A1;\n",
       "        font-style: italic;\n",
       "        margin-bottom: 0.5em;\n",
       "        margin-top: 0.5em;\n",
       "        display: block;\n",
       "    }\n",
       "\n",
       "    .warning{\n",
       "        color: rgb( 240, 20, 20 )\n",
       "        }\n",
       "</style>\n",
       "<script>\n",
       "    MathJax.Hub.Config({\n",
       "                        TeX: {\n",
       "                           extensions: [\"AMSmath.js\"]\n",
       "                           },\n",
       "                tex2jax: {\n",
       "                    inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
       "                    displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
       "                },\n",
       "                displayAlign: 'center', // Change this to 'center' to center equations.\n",
       "                \"HTML-CSS\": {\n",
       "                    styles: {'.MathJax_Display': {\"margin\": 4}}\n",
       "                }\n",
       "        });\n",
       "</script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.core.display import HTML\n",
    "def css_styling():\n",
    "    styles = open(\"styles/custom.css\", \"r\").read()\n",
    "    return HTML(styles)\n",
    "css_styling()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
